WEBVTT Kind: captions; Language: en 00:00:09.001 --> 00:00:14.000 Good afternoon, everybody. It's lovely to see you all here today joining us for 00:00:14.000 --> 00:00:18.001 the second session in our series with the Internet Archive, D-Web, and Library 00:00:18.001 --> 00:00:23.000 Futures. The entire series is titled Imagining a Better Online World, Exploring 00:00:23.000 --> 00:00:26.001 the Decentralized Web. And today we'll be talking about using the decentralized 00:00:26.001 --> 00:00:31.000 storage to keep your materials safe. My name is Davis Erin Anderson. I'm 00:00:31.000 --> 00:00:35.000 assistant director for programs and partnerships at Metro. Please say hello in 00:00:35.000 --> 00:00:39.000 chat. We'd love to know who's out there, where are you from, what's your name, 00:00:39.001 --> 00:00:42.001 what's your interest in this topic. We'd love to know who's in the audience and 00:00:42.001 --> 00:00:46.001 hear from you a little bit as we get started. So Metro is a multi-type 00:00:46.001 --> 00:00:50.001 consortium. We serve the five boroughs of New York City and Westchester County. 00:00:50.001 --> 00:00:54.001 We're a service provider. We do events like this and partnership programs like 00:00:54.001 --> 00:00:57.001 the one you're attending today. We have a group that works on software 00:00:57.001 --> 00:01:01.001 development. We provide delivery services and make sure that knowledge can be 00:01:01.001 --> 00:01:05.001 spread equitably throughout our service area. So we really care a lot about the 00:01:05.001 --> 00:01:10.000 future of how information moves. And so we are pleased and honored today to 00:01:10.000 --> 00:01:13.001 support the work that folks are doing at Internet Archive. We wanted to hear a 00:01:13.001 --> 00:01:18.000 little bit more about what they envision for the future of the web. So we're 00:01:18.000 --> 00:01:22.000 running a six-part series. This is the second part. We'll drop a link into chat 00:01:22.000 --> 00:01:26.000 that lets you see where to go to register for our upcoming sessions as well as 00:01:26.000 --> 00:01:30.001 check back on the resources we're providing for this one in the past sessions as 00:01:30.001 --> 00:01:34.001 well. So if you would please drop your questions into chat and your comments as 00:01:34.001 --> 00:01:38.001 well. We had a really robust and active conversation going for our first session 00:01:38.001 --> 00:01:42.001 and we'd love to see that happen here again. We're also providing resource guides 00:01:42.001 --> 00:01:47.000 to go with each and every one of our six parts of the series. So please also look 00:01:47.000 --> 00:01:51.001 at chat for a link to the current guide and please stay tuned to your inbox. If 00:01:51.001 --> 00:01:55.001 you registered, you'll receive a PDF copy as well. So it's my pleasure to 00:01:55.001 --> 00:01:58.001 introduce you to Wendy Hanamura. Wendy is 00:01:58.001 --> 00:02:00.000 Director of Partnerships at Internet Archive. 00:02:00.001 --> 00:02:05.001 She planned the first ever decentralized web summit a few years back. In the past 00:02:05.001 --> 00:02:09.000 six years, she's helped to guide the global growth of the decentralized web. So 00:02:09.000 --> 00:02:13.001 she's really the expert on this topic. And she's here today to co-produce the six 00:02:13.001 --> 00:02:17.001 -part series, Imagining a Better Online World, Exploring the Decentralized Web. So 00:02:17.001 --> 00:02:21.000 thank you so much, Wendy. Over to you. And it's great to see you again. Thank 00:02:21.000 --> 00:02:24.000 you, Davis. And thanks to all of you for being here today. I'm seeing friends 00:02:24.000 --> 00:02:29.000 from Berlin and Argentina, many, many from New York and Florida. We're so happy 00:02:29.000 --> 00:02:34.000 that you can be here to learn a little bit about decentralized storage. Now in 00:02:34.000 --> 00:02:38.000 this webinar, we're going to be exploring with you a new set of decentralized 00:02:38.000 --> 00:02:44.000 technologies that may help you to preserve and provide access to your media. So 00:02:44.000 --> 00:02:48.001 here's the game plan for the next 60 minutes. I'm going to start by giving us an 00:02:48.001 --> 00:02:53.000 overview of some of the problems that decentralized storage could help to solve. 00:02:54.000 --> 00:02:58.000 Then I have invited a friend of mine, the founder of Starling Lab, to share with 00:02:58.000 --> 00:03:03.000 you how his group is working with many, many cultural institutions to keep their 00:03:03.000 --> 00:03:08.001 most critical and important materials safe. We also want to show you this tech in 00:03:08.001 --> 00:03:13.000 action. So I've invited two people to demonstrate what they've been working on. 00:03:13.001 --> 00:03:18.000 First, an engineer of ours from the Internet Archive is going to be showing you 00:03:18.000 --> 00:03:24.001 how we've been experimenting saving web archives at scale and Filecoin. And a 00:03:24.001 --> 00:03:29.000 senior engineer from the Storage Decentralized Storage Company is going to show 00:03:29.000 --> 00:03:34.001 you how we've been storing LibriVox audiobooks in decentralized storage. Now both 00:03:34.001 --> 00:03:39.000 of these collections, the web archives and the audiobooks, were created 00:03:39.000 --> 00:03:43.001 collaboratively by communities. And I think that's the real promise here, that 00:03:43.001 --> 00:03:48.001 you could take collaborative collections and perhaps store them and preserve them 00:03:48.001 --> 00:03:53.000 collaboratively as well. So let's start by thinking about some of the challenges. 00:03:54.000 --> 00:03:58.000 Now many of you are archivists, you're librarians, you run cultural institutions, 00:03:58.000 --> 00:04:04.001 so this is very familiar. Your collections are ever expanding in the physical 00:04:04.001 --> 00:04:07.001 world, but also in the digital realm. 00:04:08.000 --> 00:04:13.001 Digital objects may be even harder to store, right? How do you keep things safe, 00:04:13.001 --> 00:04:19.000 not only from floods and fires, but also secure from hackers? How do you make 00:04:19.000 --> 00:04:25.000 them accessible in a time when there are broken links and content drift? How do 00:04:25.000 --> 00:04:30.001 you make sure that your data is trustworthy, especially in an era when deepfakes 00:04:30.001 --> 00:04:36.000 are growing? Then there's the scale of digital holdings, which seem to be 00:04:36.000 --> 00:04:42.000 enormous. And isn't it true that weeding digital objects feels a little bit wrong 00:04:42.000 --> 00:04:48.001 since they're just bits? How do you weed ever-growing digital collection? And 00:04:48.001 --> 00:04:54.000 what about the long-term preservation, the sustainability of this collection? How 00:04:54.000 --> 00:04:56.001 do you do digital storage in centuries? 00:04:57.001 --> 00:05:04.000 And let's not forget the issue of cost. It is so hard to predict the future costs 00:05:04.000 --> 00:05:08.000 of decentralized storage, especially when technology is changing all the time. 00:05:09.000 --> 00:05:15.001 Now that takes us to this. Think of the decentralized web as a stack with every 00:05:15.001 --> 00:05:20.001 layer of the web stack potentially decentralized. When you take all of these 00:05:20.001 --> 00:05:25.001 decentralized technologies together, that's what we call the decentralized web. 00:05:25.001 --> 00:05:30.000 And you'll notice in this diagram that the bottom layer is decentralized storage. 00:05:30.000 --> 00:05:35.000 That's the layer we're going to be exploring today. Conceptually, decentralized 00:05:35.000 --> 00:05:41.001 storage allows you to store your data across a peer-to-peer network of servers. 00:05:42.000 --> 00:05:45.001 But so does Amazon Cloud, right? So what's the difference? 00:05:45.001 --> 00:05:50.000 I would say that the difference here is really that not only is your storage 00:05:50.000 --> 00:05:57.000 location distributed, but also your storage management is decentralized. That 00:05:57.000 --> 00:06:02.000 way you can't take out just one central control entity like Amazon and have the 00:06:02.000 --> 00:06:05.001 entire system go down. So what is the promise? 00:06:06.000 --> 00:06:10.000 What does decentralized storage offer? Well, first there's the concept of 00:06:10.000 --> 00:06:14.000 resiliency. Now, we're very familiar with that in the library world. There's 00:06:14.000 --> 00:06:19.000 locks, lots of copies keep things safe. So we know that if you distribute copies 00:06:19.000 --> 00:06:23.001 across different geographic lines, geopolitical lines, it's going to be safer. 00:06:24.001 --> 00:06:28.001 Then there's the concept of persistence. Now, this is something that a lot of 00:06:28.001 --> 00:06:32.001 people get wrong when they think about the decentralized web. Just because you 00:06:32.001 --> 00:06:38.001 cut up a file and put pieces of it in different servers does not mean that those 00:06:38.001 --> 00:06:44.000 servers are guaranteed to keep your files forever. Now, persistence would mean 00:06:44.000 --> 00:06:48.001 that you'd have to have a guarantee somehow built in that the people who hold 00:06:48.001 --> 00:06:54.000 your copies will hold them forever or for a long time. So how do you ensure 00:06:54.000 --> 00:06:58.001 persistence? Well, in truth, I don't think we're really sure about that. But 00:06:58.001 --> 00:07:04.000 organizations like Filecoin and Storage are using a combination of incentives and 00:07:04.000 --> 00:07:11.000 shared protocols and contracts to try to ensure persistence. Next, I think 00:07:11.000 --> 00:07:15.001 this this step, self-certification is the most important attribute 00:07:15.001 --> 00:07:17.000 of decentralized storage. 00:07:18.001 --> 00:07:25.001 You know, here every item is assigned a unique immutable hash, a persistent ID. 00:07:26.000 --> 00:07:31.000 And you use this ID to find your things wherever they are and to copy how many 00:07:31.000 --> 00:07:38.000 people have to check how many people have copies of them. So this is something 00:07:38.000 --> 00:07:40.000 we call content addressing. 00:07:40.001 --> 00:07:44.001 And in Web 2.0, you find things based on where they're located, right? You have a 00:07:44.001 --> 00:07:49.000 URL that takes you to a place on a server. Well, in Web 3.0 or the decentralized 00:07:49.000 --> 00:07:55.001 web, the ID remains with the content itself. And if the content changes, so does 00:07:55.001 --> 00:08:01.000 the hash. So anytime something is altered, you get a new hash. And ostensibly, 00:08:01.001 --> 00:08:05.001 the self-certification is what allows you to ensure the provenance and 00:08:05.001 --> 00:08:11.001 authenticate an item. Finally, there is the goal of interoperability. 00:08:12.000 --> 00:08:16.000 I think it's pretty true that right now we have a lot of 00:08:16.000 --> 00:08:18.000 silos where our materials live. 00:08:18.001 --> 00:08:21.001 And when you want to work collaboratively on a shared data 00:08:21.001 --> 00:08:23.000 set, that can be very problematic. 00:08:23.001 --> 00:08:29.000 Now in the utopian version of decentralized storage, you can have collaborative, 00:08:29.000 --> 00:08:35.000 authenticated, co-hosted collections. And these collections would be less prone 00:08:35.000 --> 00:08:39.001 to censorship because you can't block just one URL and block the entire 00:08:39.001 --> 00:08:45.000 collection. They're also perhaps harder to hack because there's not one single 00:08:45.000 --> 00:08:49.000 honeypot to go after. They may be easier to share. 00:08:49.001 --> 00:08:53.001 Taken together, resiliency, persistence, self-certification, and 00:08:53.001 --> 00:08:58.001 interoperability, that is the promise of decentralized storage. But it is still 00:08:58.001 --> 00:09:03.000 early days. So whether or not we can deliver on those things is something we're 00:09:03.000 --> 00:09:08.000 testing. Now it is my deep pleasure to bring on Jonathan Doten. He's the founder 00:09:08.000 --> 00:09:14.000 of Starling Lab, which is the first major research laboratory devoted to Web 3.0 00:09:14.000 --> 00:09:19.000 technologies. It's affiliated with Stanford and USC. And I know that Starling has 00:09:19.000 --> 00:09:23.001 been working for quite a while with the Shoah Foundation to make sure that 00:09:23.001 --> 00:09:29.000 Holocaust testimony videos are kept safe and persistent. But here's a fun fact. I 00:09:29.000 --> 00:09:34.001 first met Jonathan Doten back in 2018 when he was the consultant for HBO Silicon 00:09:34.001 --> 00:09:39.001 Valley. And it was Jonathan Doten who convinced the showrunners to introduce a 00:09:39.001 --> 00:09:43.001 storyline about a new internet, a decentralized internet. 00:09:43.001 --> 00:09:48.000 And that's how he came to be involved with us at the D-Web community. So welcome, 00:09:48.000 --> 00:09:52.000 Jonathan Doten, founder of Starling Lab. Thanks so much, Wendy, for having me. 00:09:52.001 --> 00:09:56.000 And to the entire community that's assembled here, I can't think of a more 00:09:56.000 --> 00:09:59.001 appropriate group of folks to be speaking to about decentralized storage because 00:09:59.001 --> 00:10:06.000 certainly the power of archiving institutions and libraries and providing a new 00:10:06.000 --> 00:10:12.001 layer of trust for communities in preservation is unique. And 00:10:12.001 --> 00:10:16.001 I'm really excited to help bring you into the fold to help answer any questions 00:10:16.001 --> 00:10:20.001 and potentially even inspire you on the possibilities. At the Starling Lab, we've 00:10:20.001 --> 00:10:24.001 been working on what we call a framework for data integrity that allows you end 00:10:24.001 --> 00:10:30.001 to end to think about how you capture, store, and verify information. And the 00:10:30.001 --> 00:10:35.001 page that we really are working from here is one that was written many years ago. 00:10:36.001 --> 00:10:41.001 So I want to start with a little bit of context today, share with you a prototype 00:10:41.001 --> 00:10:45.001 of some of the early work that we've done, and then get into some of the 00:10:45.001 --> 00:10:47.001 learnings and how they might apply over to 00:10:47.001 --> 00:10:49.000 you and some of your archival use cases. 00:10:50.001 --> 00:10:55.000 So to begin, Wendy's talked a little bit about the goals of decentralization, but 00:10:55.000 --> 00:10:59.001 I want to start even a little bit more upstream from there and just get into a 00:10:59.001 --> 00:11:04.001 very simple but also, I think, poignant understanding of how decentralization 00:11:04.001 --> 00:11:10.001 works. In the prior view of communication systems, when let's say AT&T as an 00:11:10.001 --> 00:11:16.000 example was a dominant form of guaranteeing communications infrastructure, you 00:11:16.000 --> 00:11:20.001 had to rely on AT&T as a centralized node, right? Basically, all the data went 00:11:20.001 --> 00:11:25.001 into AT&T and then went out of AT&T in order to be passed along in the case of 00:11:25.001 --> 00:11:30.001 something like voice communications. The distributed web was something that was 00:11:30.001 --> 00:11:35.000 not actually done, let's say, in the 90s when the web took off. It actually was a 00:11:35.000 --> 00:11:40.000 part of the original architecture of the internet. And when Paul Barron over at 00:11:40.000 --> 00:11:44.001 the RAN is to set forth how a digital network of this kind might work, he 00:11:44.001 --> 00:11:49.000 explained that what was critical was to establish nodes that were interoperable. 00:11:49.000 --> 00:11:53.000 So that meant that if any of those nodes went down, they could be replaced and 00:11:53.000 --> 00:11:56.000 functioned by the others, that there was neutrality, that the network itself 00:11:56.000 --> 00:12:00.000 didn't discriminate, but simply passed along the information. And that meant that 00:12:00.000 --> 00:12:05.000 you could allow the end user to hold all of the intelligence and the rich 00:12:05.000 --> 00:12:08.001 applications that might exist on this type of network. 00:12:09.000 --> 00:12:14.001 And so this concept was really baked into a larger philosophical framework that 00:12:14.001 --> 00:12:19.000 was put into play by Tug Engelbar and the team over at the Augment Intelligence 00:12:19.000 --> 00:12:23.000 Framework. And what it deposited is that for a decentralized network of knowledge 00:12:23.000 --> 00:12:30.000 to exist, you needed to have at the end computers that could be used by the 00:12:30.000 --> 00:12:34.000 average user. So a personal computer and the concept of the internet were 00:12:34.000 --> 00:12:37.000 actually birthed out of the same framework. That's not actually commonly 00:12:37.000 --> 00:12:42.001 understood, but it makes sense on many levels. And as you look at the history of 00:12:42.001 --> 00:12:47.000 the web as it took off in the early 90s, it's no accident that the first HTTP 00:12:47.000 --> 00:12:49.000 server was actually a personal computer. 00:12:49.001 --> 00:12:54.000 And funny enough, this is Tim Berners-Lee's machine over at CERN. You can see 00:12:54.000 --> 00:12:57.001 there's a sticker that said, to clarify that this is not just a personal 00:12:57.001 --> 00:13:01.000 computer, but it's actually a server. And he explained to people that they 00:13:01.000 --> 00:13:08.000 shouldn't turn off this machine because it was basically hosting information. So 00:13:08.000 --> 00:13:12.000 from that, those early promising visions around decentralized technologies and 00:13:12.000 --> 00:13:16.001 how one might want to build a global internet infrastructure, there's a growing 00:13:16.001 --> 00:13:22.000 realization that now, as the internet has taken off and working at scale with so 00:13:22.000 --> 00:13:26.001 many parts of our lives touching it, that something really is a myth. That as the 00:13:26.001 --> 00:13:30.000 internet has become more centralized and dominated by corporations that have 00:13:30.000 --> 00:13:35.001 often monopolistic practices, that there's been tremendous issues with 00:13:35.001 --> 00:13:39.000 establishing trust in this type of system, and that we might want to return to 00:13:39.000 --> 00:13:42.001 decentralization to restore trust and to 00:13:42.001 --> 00:13:44.001 restore a sense of fairness within the internet. 00:13:44.001 --> 00:13:48.001 So that's where a lot of these ideas are coming from, to be clear. This is 00:13:48.001 --> 00:13:53.001 actually how the original internet was designed and meant to be. The 00:13:53.001 --> 00:13:58.001 decentralized web summit in 2018, which is how I got really into the fold, was a 00:13:58.001 --> 00:14:02.001 program that was hosted by Wendy and Bruce over at the Internet Archive. And it 00:14:02.001 --> 00:14:06.000 was really transformative in bringing a number of different individuals together 00:14:06.000 --> 00:14:12.000 to think about these issues. And it's important to mention this type of, the 00:14:12.000 --> 00:14:16.000 importance of this event, it was really that it was a cultural event, as well as 00:14:16.000 --> 00:14:20.000 a technical event, in which people were thinking about shared values that were 00:14:20.000 --> 00:14:24.000 distinctly different from things that you may have heard of around blockchains or 00:14:24.000 --> 00:14:28.001 cryptocurrencies. This was really a group of people that were concerned with how 00:14:28.001 --> 00:14:31.001 do you guarantee access to knowledge and how can you use decentralized 00:14:31.001 --> 00:14:36.001 technologies for things like preservation, which is our topic today. So the 00:14:36.001 --> 00:14:41.001 Starling Lab came out of really that rich tradition. And in working with the USC 00:14:41.001 --> 00:14:45.001 Shoah Foundation and Stanford's Department of Electrical Engineering, we've found 00:14:45.001 --> 00:14:49.000 this incredible opportunity to bring together experts to think about how we can 00:14:49.000 --> 00:14:54.001 deploy decentralized technologies to advance human rights. So we work with a 00:14:54.001 --> 00:15:01.000 number of different industry partners. And really the founding work and research 00:15:01.000 --> 00:15:04.000 that we did was with the USC Shoah Foundation's Visual History Archive. 00:15:04.001 --> 00:15:08.000 As Wendy mentioned, this is an archive that deals with the testimony 00:15:08.000 --> 00:15:09.001 of the survivors of genocide. 00:15:10.000 --> 00:15:15.000 It started 30 years ago by cataloging the stories of survivors of the Holocaust, 00:15:15.001 --> 00:15:19.001 but it expanded and now they're working on, I believe they're on their 14th 00:15:19.001 --> 00:15:21.001 genocide collection as of last week. 00:15:23.001 --> 00:15:27.001 The, and sadly, of course, that number continues to increase. There are over 55 00:15:27.001 --> 00:15:31.001 ,000 survivors' testimonies on average. It's about two and a half hours and 00:15:31.001 --> 00:15:37.000 several gigabytes for every testimony. So it's a massive four-bedabyte 00:15:37.000 --> 00:15:42.000 collection. And currently it sits in three different data centers that are all 00:15:42.000 --> 00:15:47.000 state-of-the-art tape-based archival systems. But working with them, we really 00:15:47.000 --> 00:15:51.000 reimagined a continuum of preservation that goes beyond just their data centers 00:15:51.000 --> 00:15:57.000 that are maintained by the USC Shoah Foundation. Bravely their CTO, Sam Gussman, 00:15:57.000 --> 00:16:02.000 has been working with us on figuring out a way to take the entire four petabytes 00:16:02.000 --> 00:16:07.000 of the Shoah Foundation's archive and put it onto the distributed web. And then 00:16:07.000 --> 00:16:10.001 in addition, I should mention that he's also looking at longer-term media 00:16:10.001 --> 00:16:14.001 storage, like storage on silica and DNA, et cetera. So they're quite innovative 00:16:14.001 --> 00:16:19.000 and progressive in thinking about preservation. The type of content that we've 00:16:19.000 --> 00:16:24.000 been working on is in looking at genocide testimonies. We've expanded our 00:16:24.000 --> 00:16:27.001 testimony collections with them to understand how the whole life cycle of 00:16:27.001 --> 00:16:34.001 testimony is collected and preserved and indexed. We've gone to Iraq, Los 00:16:34.001 --> 00:16:39.000 Angeles, the Amazon rainforest. We've been working in Syria on preserving 00:16:39.000 --> 00:16:44.000 testimony that can be potentially not only useful for humanitarian causes, but 00:16:44.000 --> 00:16:48.001 also for accountability as well. So we look at preservation for those purposes. 00:16:49.001 --> 00:16:52.001 And finally, we've been working with news organizations like Reuters to look at 00:16:52.001 --> 00:16:57.000 their archives. And most recently, we finished a project with them last year 00:16:57.000 --> 00:17:02.000 called the 78 Days, which catalogued each of the 78 days between the election and 00:17:02.000 --> 00:17:08.001 the inauguration. That included January 6th as well. What we found is that really 00:17:08.001 --> 00:17:12.000 what we're creating is a set of solutions is not just centered around 00:17:12.000 --> 00:17:17.000 preservation, but it's also around restoring trust. And so what I want to show 00:17:17.000 --> 00:17:21.000 you today is how we think about how that might work end to end. And really, it 00:17:21.000 --> 00:17:27.000 begins by following the natural life cycle of how you go and generate data, which 00:17:27.000 --> 00:17:30.001 would start with capturing. And then you move to storage. And then finally, 00:17:30.001 --> 00:17:36.000 verification. And it's with each of those steps that are critical to ensuring 00:17:36.000 --> 00:17:40.000 that you have a data set that could be trusted. And what I want to show you today 00:17:40.000 --> 00:17:44.000 is how decentralized systems can actually guarantee trust at each of these three 00:17:44.000 --> 00:17:49.001 stages. So let's begin with capturing on something like, let's say, a mobile 00:17:49.001 --> 00:17:54.000 phone. In our case, what we've done is we've taken not only the phone's ability 00:17:54.000 --> 00:17:58.000 to take a photo, but also looked at all the other sensor information that exists 00:17:58.000 --> 00:18:03.001 on the phone, like GPS for location, network information to establish a relative 00:18:03.001 --> 00:18:08.001 location, the gyroscope to understand the relative position of the camera, and 00:18:08.001 --> 00:18:11.001 even things like time and date, right? These are all critical pieces of metadata 00:18:11.001 --> 00:18:17.000 that the phone is able to generate. What we do is we take that metadata and we 00:18:17.000 --> 00:18:20.001 pair it with the image so that every time that you take a photo, you now have the 00:18:20.001 --> 00:18:27.000 payload, not only of the image pixels, but also of this metadata. Now, through 00:18:27.000 --> 00:18:31.001 our process of working with HTC, we've been able to take this payload and do 00:18:31.001 --> 00:18:35.000 something really special with it on the device, which is that we first of all 00:18:35.000 --> 00:18:39.000 create a hash of it on the device itself. And then we sign that hash with a 00:18:39.000 --> 00:18:43.000 cryptographic key that is guaranteed by firmware, which is on specialized 00:18:43.000 --> 00:18:48.000 hardware with inside the phones. So what that means really simply is that now you 00:18:48.000 --> 00:18:53.001 have a unique fingerprint of both the image and the metadata, and we've signed 00:18:53.001 --> 00:18:59.000 that so that we know that that fingerprint is secure. So with that payload of 00:18:59.000 --> 00:19:03.000 preservation information and all the metadata, we now take it and we put it on 00:19:03.000 --> 00:19:08.001 the decentralized web. So that step begins by first creating a CID. 00:19:08.001 --> 00:19:14.001 Which is a unique identifier for that payload. And then we spread it out across 00:19:14.001 --> 00:19:19.000 the decentralized web, basically splitting up into different pieces. And there we 00:19:19.000 --> 00:19:23.001 can store it onto different types of nodes. So you could imagine academic 00:19:23.001 --> 00:19:29.000 institutions, nonprofits, enterprise cloud, even small devices like the 00:19:29.000 --> 00:19:36.000 raspberry pi or a personal computer, even a phone. All of these different 00:19:36.000 --> 00:19:39.000 nodes are, in our minds, appropriate for storage because we want to diversify 00:19:39.000 --> 00:19:43.001 storage. And that's really critical to our framework. In addition to that, we use 00:19:43.001 --> 00:19:48.000 cryptography and advanced proves a space time like the kind of Falcoin has an 00:19:48.000 --> 00:19:52.000 example to ensure that as you spread information far and wide, you're also 00:19:52.000 --> 00:19:57.000 ensuring that its integrity is kept. And that if any of those nodes which takes 00:19:57.000 --> 00:20:01.001 the data, manipulates the data, we now have a way of proving that in fact, that 00:20:01.001 --> 00:20:06.001 manipulation has occurred. Paradoxically, what this means is that as you spread 00:20:06.001 --> 00:20:10.001 information farther and wider, not only are you able to preserve the information 00:20:10.001 --> 00:20:15.000 better, but you're actually able to create a seal around the information. And 00:20:15.000 --> 00:20:19.000 with more and more nodes joining that network, the harder it is to break that 00:20:19.000 --> 00:20:24.001 seal. So that's our preservation story. But as you all know, in working in the 00:20:24.001 --> 00:20:28.001 archival space, it doesn't end there. Just because you have a record of something 00:20:28.001 --> 00:20:32.001 that you prove has not been manipulated, still the contents of the objects matter 00:20:32.001 --> 00:20:37.000 to be, they need to be examined and they need to be indexed. And so the expert 00:20:37.000 --> 00:20:41.001 certification of the content of information is something that is normally done 00:20:41.001 --> 00:20:43.001 through an archival process of indexing and verification. 00:20:44.000 --> 00:20:49.000 So we take those adaptations, and those two, we also put those on decentralized 00:20:49.000 --> 00:20:53.001 systems so that those records of the authenticity and the verification of the 00:20:53.001 --> 00:21:00.000 content itself can also be preserved on a decentralized system. So that becomes 00:21:00.000 --> 00:21:07.000 basically the three parts of our system, capture, store, and verify. So as a last 00:21:07.000 --> 00:21:09.000 step, I want to show you where all this stuff is stored. 00:21:09.001 --> 00:21:14.001 In working with Adobe and Microsoft and the Linux Foundation, we've been helping 00:21:14.001 --> 00:21:18.001 pioneer a set of standards called the C2PA, which allow you to take all this 00:21:18.001 --> 00:21:24.000 information and actually put it directly, as an example, with a JPEG inside the 00:21:24.000 --> 00:21:28.001 photograph itself. So that now the photograph becomes a universe of information, 00:21:28.001 --> 00:21:33.001 not only of image pixel data, but also of these cryptographic proofs and also of 00:21:33.001 --> 00:21:38.001 these verifications. So now if, for instance, I, in this example, can use a small 00:21:38.001 --> 00:21:42.001 app, I can click on this eye and I can see all of the information around the 00:21:42.001 --> 00:21:47.000 photograph and its metadata. And I can also see the links back to where this 00:21:47.000 --> 00:21:50.001 information sits on the decentralized web. All of this just contained inside of 00:21:50.001 --> 00:21:54.001 the JPEG. So we think that's pretty nifty because it now changes every photograph 00:21:54.001 --> 00:21:59.001 from being just a photo, a container of image pixels, to now being a universe of 00:21:59.001 --> 00:22:06.000 information for fact checking, for image verification, etc, etc. All right, I'm 00:22:06.000 --> 00:22:09.001 going to close very quickly by just going through our prototype and some of our 00:22:09.001 --> 00:22:13.000 learnings. So I described you a little bit about the work that we did with 00:22:13.000 --> 00:22:17.001 Reuters, but it really was an unfolding set of experiments during the course of 00:22:17.001 --> 00:22:23.000 the 2020 election. And what we did is we used our technology to go out with 00:22:23.000 --> 00:22:26.001 photojournalists at Reuters, and I'll show you end to end how we established this 00:22:26.001 --> 00:22:31.001 new form of digital trust. So it began by having photos from this professional 00:22:31.001 --> 00:22:35.000 grade camera move over to the phone, where it was notarized through the process 00:22:35.000 --> 00:22:39.000 I've described, and then it ends up in the CMS system at the Reuters headquarters 00:22:39.000 --> 00:22:42.001 at their photo desk in London. And then from there, we took that information, 00:22:43.000 --> 00:22:48.000 which included again, not only the photo, but also things like location and a 00:22:48.000 --> 00:22:52.001 hash of the image, all that complex metadata, and we were able to syndicate it 00:22:52.001 --> 00:22:59.000 out to different decentralized systems. So in this case, the first step was to 00:22:59.000 --> 00:23:03.001 syndicate it out to a private permission system with IBM. So this is a form of a 00:23:03.001 --> 00:23:07.001 form of blockchain technology that's called a private permission ledger. So 00:23:07.001 --> 00:23:10.001 that's the first step. And then the second step was we put it on a public 00:23:10.001 --> 00:23:14.000 permissionless ledger, which is similar to something you can think of almost like 00:23:14.000 --> 00:23:18.001 Bitcoin, where we were able to store a hash of that information also out on the 00:23:18.001 --> 00:23:22.000 public web. So this allowed you to preserve privacy, and 00:23:22.000 --> 00:23:23.001 also have a public system of verification. 00:23:25.000 --> 00:23:27.000 I think we could have never imagined what we were actually going to 00:23:27.000 --> 00:23:29.000 capture during those 78 days. 00:23:29.001 --> 00:23:31.001 I think they caught all of us by surprise in terms of 00:23:31.001 --> 00:23:33.000 how historic they proved to be. 00:23:34.000 --> 00:23:39.000 But what I can say is that the efforts of our technology development were 00:23:39.000 --> 00:23:43.000 certainly on they weighed very heavily with us as we thought about what we were 00:23:43.000 --> 00:23:47.001 doing and helping think about the restoration of trust. Because surely I think we 00:23:47.001 --> 00:23:50.001 can all agree, no matter what side of the aisle we're on, that the demonization 00:23:50.001 --> 00:23:54.001 of the free press is something that we should all strive to end. 00:23:55.001 --> 00:23:59.001 And hopefully, the work that we're doing in creating an archive that can sustain 00:24:00.000 --> 00:24:05.000 the challenges of misinformation and the challenges of manipulation through 00:24:05.000 --> 00:24:09.001 social media is a really good step in that direction. And so you can check out 00:24:09.001 --> 00:24:14.001 the website yourself, starlinglab. org, 78 days, I'll give you a chance to play 00:24:14.001 --> 00:24:18.001 around with the archive and also see more in depth explanations of the 00:24:18.001 --> 00:24:25.001 technology. So I'll wrap up here with the last minute about our learnings. You 00:24:25.001 --> 00:24:29.001 matter. It's probably the biggest thing I can mention, which is that institutions 00:24:29.001 --> 00:24:33.001 like libraries and archivists are a key part of creating a solution that is 00:24:33.001 --> 00:24:39.000 networked, and that as a community, if we can all come together to guarantee the 00:24:39.000 --> 00:24:42.001 integrity of information, we're in a unique position to create a new foundation 00:24:42.001 --> 00:24:47.000 of digital trust. So it takes that form of collaboration, and that really when we 00:24:47.000 --> 00:24:51.000 think about decentralization, it's not a single destination, but it's an 00:24:51.000 --> 00:24:55.000 unfolding process in which we continually strive to bring more and more diverse 00:24:55.000 --> 00:25:00.001 nodes into our system. And the more diverse those nodes are, the more that 00:25:00.001 --> 00:25:04.001 they're going to be able to store and verify information. And so that's really 00:25:04.001 --> 00:25:08.001 why you might think of multiple ledgers and multiple decentralized systems coming 00:25:08.001 --> 00:25:14.000 into play, because they can allow for a tremendous amount of diversification of 00:25:14.000 --> 00:25:18.001 cryptographic features, of performance, methods of preservation, and last, of 00:25:18.001 --> 00:25:25.001 course, diversity use. Think of decentralization a lot like biodiversity. This is 00:25:25.001 --> 00:25:30.000 how we get resilience as a community, and both at a technical level and also at a 00:25:30.000 --> 00:25:33.000 community level. Right. With that, I'll pass it back 00:25:33.000 --> 00:25:34.001 to Wendy. Thanks so much for having me. 00:25:36.000 --> 00:25:40.000 Thank you, Jonathan. We have some questions, some really good questions. So one 00:25:40.000 --> 00:25:43.000 question is, how does this differ actually from 00:25:43.000 --> 00:25:45.001 BitTorrent, which is a very good question? 00:25:46.000 --> 00:25:49.001 There's a lot of similarities, actually. So BitTorrent works by syndicating 00:25:49.001 --> 00:25:54.000 information across multiple different nodes. Some of the big differences in our 00:25:54.000 --> 00:25:57.000 work is that we choose nodes. 00:25:57.000 --> 00:26:01.001 So whereas BitTorrent is meant to be diffuse and random with how information is 00:26:01.001 --> 00:26:06.000 spread across and it's optimized basically at the protocol level, we think about 00:26:06.000 --> 00:26:11.001 the decentralization process as something that we want archives to have a role in 00:26:11.001 --> 00:26:16.001 choosing which nodes they distribute their information. And so that's a major 00:26:16.001 --> 00:26:21.000 distinction. Is your tech open source? And can you point us 00:26:21.000 --> 00:26:27.001 to a open source technology? 00:26:28.000 --> 00:26:33.000 And our prototypes are we're in the process of putting out various parts of our 00:26:33.000 --> 00:26:37.000 code base, but really we haven't created any novel technology. We've just created 00:26:37.000 --> 00:26:42.000 novel implementation. So I'll be very happy to refer you over to our website. And 00:26:42.000 --> 00:26:46.000 if you want to reach out, I can give you a list of the different protocols that 00:26:46.000 --> 00:26:47.001 we've used. And all of those are open source. 00:26:48.000 --> 00:26:52.000 And we are very firmly committed to being a part of an open source ecosystem, 00:26:52.000 --> 00:26:57.000 both as contributors and also publishers. So Jonathan, what's the name of that 00:26:57.000 --> 00:26:59.001 JPEG embedded metadata standard? 00:27:00.000 --> 00:27:06.001 Librarians are very keen on helping to create better metadata. Sure. So the link 00:27:06.001 --> 00:27:11.000 is actually there. I see Heather's put it in, which is the C2PA. And I'd really, 00:27:12.000 --> 00:27:16.000 there's a very welcoming and open environment there for people to weigh in. I 00:27:16.000 --> 00:27:20.000 think archivists are a key part of helping us come up with a standard that's 00:27:20.000 --> 00:27:26.000 going to be useful for them. So we'd be really happy for people to contribute to 00:27:26.000 --> 00:27:29.001 that standard. It's based out of the Linux Foundation. So it too 00:27:29.001 --> 00:27:31.000 has open source commitments. 00:27:33.000 --> 00:27:40.000 So someone said, are you licensing software? As an organization, no. We're a lab 00:27:40.000 --> 00:27:45.001 that's experimenting to help create some of the art of the possible. And we have 00:27:45.001 --> 00:27:49.001 various partners that we work with. Almost all of them are fully 00:27:49.001 --> 00:27:51.000 transparent and open source in their work. 00:27:51.001 --> 00:27:55.001 That's a key criteria in working with them. And in that way, there's really no 00:27:55.001 --> 00:28:00.000 complexities with the licensing. You can use it, you can fork it, et cetera. 00:28:00.000 --> 00:28:04.001 Nicholas Taylor mentions that I had brought up the incentive and contracts as 00:28:04.001 --> 00:28:09.001 mechanisms for ensuring persistence. Can you elaborate on how persistence is 00:28:09.001 --> 00:28:16.000 assured or supported in the Starling frameworks? Sure. So remember, we're a 00:28:16.000 --> 00:28:18.000 framework that allows you to help make better choices. 00:28:18.000 --> 00:28:23.000 And we use a variety of different protocols. And each of those protocols are, we 00:28:23.000 --> 00:28:26.000 don't endorse them as best practices, but we're experimenting with them to 00:28:26.000 --> 00:28:31.000 understand how they could achieve persistence. I'd say that if you look at 00:28:31.000 --> 00:28:37.000 currently what's out there, I'd caution people that there are some big promises 00:28:37.000 --> 00:28:41.000 that are being made about immutability and persistence and permanence. 00:28:41.001 --> 00:28:46.001 We as a lab try to avoid those words, because we're concerned that with any of 00:28:46.001 --> 00:28:50.001 these technologies in the communities, history shows that really 00:28:50.001 --> 00:28:52.001 nothing can be guaranteed to be permitted. 00:28:53.001 --> 00:28:57.000 And so it really takes active efforts to ensure that type of thing. Now, what's 00:28:57.000 --> 00:29:01.001 new is that you really have these incentive layers that could potentially allow 00:29:01.001 --> 00:29:04.001 people to think about the creation of endowments, for instance, that could 00:29:04.001 --> 00:29:08.001 persist for years and years if they're really architected, and if the economics 00:29:08.001 --> 00:29:15.000 bear out. So in all the cases, whether it's Filecoin or Arweave, people from 00:29:15.000 --> 00:29:18.000 storage are here as well, they can talk to you about how you can use some of 00:29:18.000 --> 00:29:22.000 those incentives to help ensure that people that are hosting information are 00:29:22.000 --> 00:29:27.000 incentivized to do that long term. But the reality is that that's never a passive 00:29:27.000 --> 00:29:31.000 effort. The data owners and the archivists like you have to be involved in 00:29:31.000 --> 00:29:33.000 helping architect some of those best practices. 00:29:34.000 --> 00:29:37.000 And you shouldn't gloss over the details, because it's really important that 00:29:37.000 --> 00:29:41.001 everyone understand what are the incentive mechanisms and the security mechanisms 00:29:41.001 --> 00:29:47.001 there. We have some very knowledgeable questioners here. As Kiernan says, is 00:29:47.001 --> 00:29:52.000 someone more familiar with LTO storage and trusting the hashes and bag manifests? 00:29:52.000 --> 00:29:59.000 Is the idea here that these are not trustworthy enough in certain contexts? To be 00:29:59.000 --> 00:30:02.001 clear, I'm not as familiar with LTO storage, so you can help enlighten Kiernan. 00:30:03.000 --> 00:30:06.000 But what we found is that typically, most archiving 00:30:06.000 --> 00:30:08.000 organizations will just have hashes. 00:30:08.001 --> 00:30:13.001 They'll just store hashes like a SHA-256 of their underlying data. And that is 00:30:13.001 --> 00:30:18.001 not enough, because unless you sign that information, you really don't have a way 00:30:18.001 --> 00:30:22.000 of protecting those hashes and ensuring that they have integrity. So we're 00:30:22.000 --> 00:30:27.001 providing not only a hashing signing, but then also a way of putting that 00:30:27.001 --> 00:30:31.001 information on a decentralized ledger. So think about it as like the belt and 00:30:31.001 --> 00:30:35.000 suspenders in this case. But we're not taking anything for granted about the 00:30:35.000 --> 00:30:39.000 integrity of the hash. Instead, we are finding multiple layers of trust that we 00:30:39.000 --> 00:30:42.001 can put on top of the hash, so that we can all ensure that when we look back, 00:30:42.001 --> 00:30:46.001 let's say in 50 years, that we know that that hash was actually properly created, 00:30:46.001 --> 00:30:51.001 and it was secured over time. Jonathan, I don't know if you've ever thought of 00:30:51.001 --> 00:30:55.001 this, but you're speaking to a lot of people from memory institutions like 00:30:55.001 --> 00:31:01.000 libraries, museums. Looking in the future, where do you see decentralized storage 00:31:01.000 --> 00:31:03.001 applied in their world? 00:31:06.000 --> 00:31:09.001 I like to think about it when I talk to the folks at the Shoah Foundation who are 00:31:09.001 --> 00:31:14.000 on the archiving side. I like to put their mind at ease and say, I think this is 00:31:14.000 --> 00:31:19.000 a backup to the backup. And what I mean by that is the starting point is that 00:31:19.000 --> 00:31:23.000 this is really cold storage and it's diffuse. So that means it's going to take 00:31:23.000 --> 00:31:27.001 time to reconstitute these types of archives if we need to have a restore event. 00:31:28.000 --> 00:31:32.000 And that's okay, because actually that's a great form of resilience, is to think 00:31:32.000 --> 00:31:37.001 about how you can diversify organizations and geography. And if that takes a 00:31:37.001 --> 00:31:42.001 little bit longer, to get this backup of a backup back in your hands, I'd argue 00:31:42.001 --> 00:31:44.000 to you that that's still really valuable. 00:31:45.000 --> 00:31:49.000 Having been part of many technology organizations over the last 20 years, I can't 00:31:49.000 --> 00:31:52.000 tell you how many times we've been in a situation where we've trusted our vendor 00:31:52.000 --> 00:31:57.000 and trusted all the preparations we've made. And in the end, the server that was 00:31:57.000 --> 00:32:01.000 still standing was the one that was offline in the middle of nowhere that someone 00:32:01.000 --> 00:32:06.000 forgot even existed. Those are the types of things that can be essentially that 00:32:06.000 --> 00:32:09.000 type of serendipity is something you don't want to bank on. Instead, you want to 00:32:09.000 --> 00:32:14.001 actually think a little bit ahead. And these types of systems right now in their 00:32:14.001 --> 00:32:19.000 current state really can function in that way. They can be part of, I would say 00:32:19.000 --> 00:32:25.000 they're outside of your traditional and your performant forms of storage, but 00:32:25.000 --> 00:32:29.001 instead are a new way to think about preservation. And as these technologies get 00:32:29.001 --> 00:32:34.000 more mature, then we can start to move them up in our priority and reliability. 00:32:34.001 --> 00:32:37.001 Thanks so much for joining us and for the great work you're doing 00:32:37.001 --> 00:32:39.000 with so many different organizations. 00:32:40.000 --> 00:32:42.000 Likewise, Wendy, we're always inspired by you as 00:32:42.000 --> 00:32:44.000 well. Cheers. Thanks for having me. 00:32:44.000 --> 00:32:48.001 Okay, well, let's go on to see some demos. I mean, what Jonathan was talking 00:32:48.001 --> 00:32:54.001 about was cold storage, but what if you wanted active storage at scale? We're 00:32:54.001 --> 00:33:00.000 going to be showing you two projects that try to experiment with that. First, I'd 00:33:00.000 --> 00:33:04.001 like to introduce to you Arkady Kukarkin. He is one of the top D-Web engineers 00:33:04.001 --> 00:33:09.001 working today, and we are so honored and pleased that he works with us at the 00:33:09.001 --> 00:33:15.000 Arkiv. He was the founding CTO of an organization called Media Chain, which used 00:33:15.000 --> 00:33:20.001 blockchains to authenticate the provenance of music. And he also worked for 00:33:20.001 --> 00:33:26.001 Protocol Labs, which is the parent company of Filecoin. Now we gave Arkady this 00:33:26.001 --> 00:33:31.000 experiment to work on. Could you take a different type of data file, in this 00:33:31.000 --> 00:33:36.001 case, Warks or WebArchive files, and could you store them at scale across the 00:33:36.001 --> 00:33:41.001 Filecoin network? And we chose this collection, the End of Term Archive from 00:33:41.001 --> 00:33:47.000 2016. Now that was at the end of the Obama administration, the beginning of the 00:33:47.000 --> 00:33:52.000 Trump administration, and it gathered together the entire federal presence, every 00:33:52.000 --> 00:33:58.001 .gov and .mil website at that time. It was a collaborative collection, the 00:33:58.001 --> 00:34:02.001 Library of Congress, Stanford, California Digital Library, and many institutions 00:34:02.001 --> 00:34:07.001 worked together with the Internet Archive to pull this together. It's about 200 00:34:07.001 --> 00:34:11.001 terabytes large. Now if you were going to replicate it three times, that's 00:34:11.001 --> 00:34:13.000 600 terabytes you need. 00:34:13.001 --> 00:34:20.000 It's about 20,000 items, a million files, billions of individual URLs. So Arkady, 00:34:20.000 --> 00:34:27.000 can you show us how you've been doing? Hello. So my name is Arkady Kukarkin, and 00:34:27.000 --> 00:34:34.000 I'm going to show you how our experiment here is going so far. And let's just 00:34:34.000 --> 00:34:40.001 get started. So we use two technologies here primarily, IPFS and Filecoin. IPFS 00:34:40.001 --> 00:34:45.001 you can think of as a way to locate and retrieve content through a peer-to-peer 00:34:45.001 --> 00:34:51.000 network, and Filecoin you can think of as a way to ensure, or at least attempt to 00:34:51.000 --> 00:34:57.001 ensure, the long-term preservation of that content. So probably the best way to 00:34:57.001 --> 00:35:03.001 dive in is to just look at a simple example. So I have here 00:35:03.001 --> 00:35:09.000 IPFS enabled in my browser, it's engraved, but you can also install an extension 00:35:09.000 --> 00:35:12.000 to do this in any other browser as well. 00:35:13.000 --> 00:35:18.000 And we can take a look at my node here. So here's some stats, but the most 00:35:18.000 --> 00:35:23.000 interesting thing is probably the peer list, which may take a second to populate, 00:35:23.001 --> 00:35:30.001 but you can see I'm connected to almost 1400 peers throughout the world. And as 00:35:30.001 --> 00:35:37.000 they're coming up now, we actually see some in Russia and Ukraine as well, 00:35:37.001 --> 00:35:41.000 which is an interesting demonstration of the resiliency of these peer-to-peer 00:35:41.000 --> 00:35:46.000 connections, because as you know, web traffic to those places 00:35:46.000 --> 00:35:47.001 is currently disrupted. 00:35:48.001 --> 00:35:53.000 So let's take a look at just a simple image file here on the Metro website. 00:35:54.000 --> 00:35:59.001 We can import it into IPFS just like you could any normal file. 00:36:00.000 --> 00:36:01.001 Okay, here it is. 00:36:03.001 --> 00:36:10.000 And let's take a look. So IPFS, bam. So here's our 00:36:10.000 --> 00:36:15.001 image, and you can see sort of funny looking URL here at the top. Hopefully you 00:36:15.001 --> 00:36:22.000 can read that, but instead of HTTP, we have IPFS, and then we have this sort of 00:36:22.000 --> 00:36:28.000 scary looking long identifier. And what happened here is that the file was loaded 00:36:28.000 --> 00:36:34.000 into my local node and hashed and made available to the entire IPFS network. 00:36:35.000 --> 00:36:42.000 So if anyone, pretty much anywhere in the world, were to enter this IPFS URL, 00:36:42.001 --> 00:36:46.001 they would be able to access this file, maybe from my machine, maybe from another 00:36:46.001 --> 00:36:50.000 machine that also happens to have the same one, maybe from an intermediate node, 00:36:50.001 --> 00:36:56.000 someone in that network of 1400 machines that I've showed you. So I think this is 00:36:56.000 --> 00:37:03.000 already cool, because you're able to access a file simply by its 00:37:03.000 --> 00:37:08.000 identifier, the CID, that Jonathan mentioned already, without knowing or really 00:37:08.000 --> 00:37:14.000 caring where it came from. The reason that works is that the CID 00:37:14.000 --> 00:37:20.001 is actually, well, it's a little bit truncated here, but this 00:37:21.001 --> 00:37:26.001 long string is in fact an encoding of a content hash, which 00:37:26.001 --> 00:37:28.001 again, was mentioned by Jonathan. 00:37:29.000 --> 00:37:33.001 So we're not applying as a rigid of a standard here. So it's not a sign hash. But 00:37:33.001 --> 00:37:38.001 nonetheless, if you request this particular identifier, you are pretty much 00:37:38.001 --> 00:37:44.001 guaranteed to get the exact same file back. So I think that's already pretty 00:37:44.001 --> 00:37:51.001 cool, because if we think about something like the lifetime of hyperlinks 00:37:51.001 --> 00:37:57.001 in a research paper, so this is just the graphic I pulled down. So after just a 00:37:57.001 --> 00:38:04.001 few years, something close to 50% of all hyperlinks across academic papers are no 00:38:04.001 --> 00:38:09.000 longer resolvable. And maybe they exist elsewhere, let's say Internet Archive has 00:38:09.000 --> 00:38:13.000 archived a copy in the Wayback Machine, or someone else has a copy, but the 00:38:13.000 --> 00:38:20.000 actual link is broken, and needs to be manually fixed or followed, and for trust 00:38:20.000 --> 00:38:26.001 to be insured. So imagine the same paper using these references instead of a 00:38:26.001 --> 00:38:31.001 traditional URL, it will just work as long as another copy is available in the 00:38:31.001 --> 00:38:38.001 network. So let's move on to a real example. So the data set 00:38:38.001 --> 00:38:42.001 that Wendy mentioned is the end of term Web Archive, we're using the 2016 00:38:42.001 --> 00:38:49.000 version, which I think is probably a relatively hot set, as it were. 00:38:49.001 --> 00:38:56.001 And here's a copy that's just available on the web. And you can 00:38:57.001 --> 00:39:03.001 load a page here, so a little bit slow, but here we are, here's the Indianapolis 00:39:03.001 --> 00:39:10.001 FBI Bureau in fall of 2016. And here's 00:39:10.001 --> 00:39:17.000 what the backing data looks like. So this is just a whole lot of 00:39:17.000 --> 00:39:24.000 basically gigabyte sized work web archives. And so just as before, we 00:39:24.000 --> 00:39:30.000 have the CID identifier. And we can pull it up. 00:39:30.001 --> 00:39:37.001 And in fact, we can actually load it into some tools that have already added 00:39:37.001 --> 00:39:43.000 IPFS loading support secures, a replay web. page, which is actually just a static 00:39:43.000 --> 00:39:49.001 file that loads from IPFS itself as well, unless you browse the collection. 00:39:50.000 --> 00:39:57.000 So that's already pretty cool. So if you're a researcher, or an archivist, you 00:39:57.000 --> 00:40:02.001 may already de facto have a copy of this having accessed it. So we have lots of 00:40:02.001 --> 00:40:07.001 copy, they're keeping stuff safe. But is it safe enough? I think in this case, 00:40:07.001 --> 00:40:12.000 it's actually probably not the case, because this is important data, but it's a 00:40:12.000 --> 00:40:18.000 very large amount of data. And it's data that will probably sit around on not 00:40:18.000 --> 00:40:24.001 looked at that for the most part, until you're actually needed. So what do we do? 00:40:25.000 --> 00:40:31.001 Well, one solution is Filecoin. So we're using a tool called S-Cherry. S 00:40:31.001 --> 00:40:37.000 -Cherry is one of several clients with a Filecoin network. But what it does is 00:40:37.000 --> 00:40:43.000 essentially manage storage deals within Filecoin. With the Filecoin, 00:40:44.000 --> 00:40:46.000 basic primitive is a deal. 00:40:47.000 --> 00:40:51.001 And it is made between you as the clients and any number of storage providers. 00:40:52.000 --> 00:40:58.001 Here is a global map of the storage providers online currently. And 00:40:58.001 --> 00:41:02.001 at the end of the day, I care about where they're located. But because of the 00:41:02.001 --> 00:41:08.000 promises of the network and the protocol, I actually don't care who I'm talking 00:41:08.000 --> 00:41:13.001 to exactly, because the storage integrity is the protocol level primitive. So 00:41:13.001 --> 00:41:20.000 here we have a 3x replication across some files. And we can take a look 00:41:20.000 --> 00:41:27.000 here. So the bright green is fully online. And some of these others have 00:41:27.000 --> 00:41:33.000 actually shown storage faults. And the S-Cherry system has now gone ahead and 00:41:33.000 --> 00:41:38.001 recreated these additional replicas. So they're now in the process known as 00:41:38.001 --> 00:41:42.001 ceiling. And we can take a look. So here we have a provider. 00:41:45.000 --> 00:41:51.001 The provider, I don't actually know much about them, but we can take a look. Here 00:41:51.001 --> 00:41:58.000 they are. And this is a replica that we have in Montreal. So that's great. 00:41:59.000 --> 00:42:04.000 So I'd like to make a very quick note here, which is that coin might make you 00:42:04.000 --> 00:42:09.000 think of energy usage, of danger to the environment. And that is a 00:42:09.000 --> 00:42:11.000 very reasonable concern. 00:42:11.001 --> 00:42:15.000 So the important thing to realize with Filecoin is that it does not use the 00:42:15.000 --> 00:42:21.000 wasteful proof of work mechanism of Filecoin. The actual ongoing data 00:42:21.000 --> 00:42:25.001 verification that happens at the protocol level also ensures the integrity of the 00:42:25.001 --> 00:42:27.001 network. You can read more about it here at this link. 00:42:28.000 --> 00:42:34.001 And you can look at the volunteer energy disclosures at the Filecoin energy. So 00:42:34.001 --> 00:42:41.000 of course, there are many other systems that attempt to solve these 00:42:41.000 --> 00:42:47.000 problems as well. So there's IPFS cluster, which is a sort of collaborative 00:42:47.000 --> 00:42:53.001 backup solution. There's textile, which is a measure Filecoin client tool. 00:42:54.000 --> 00:43:00.001 There's storage, which will be right up next. There's Arweave, which aims to 00:43:00.001 --> 00:43:06.001 achieve a long term or potentially infinite storage with a finite upfront cost, 00:43:06.001 --> 00:43:13.001 which is 00:43:13.001 --> 00:43:19.000 an public bestiary note that's already hosting hundreds of terabytes of data 00:43:19.000 --> 00:43:20.001 for its users. 00:43:21.000 --> 00:43:28.000 And I think that's it. Thank you. Thank you so much, Arkady. You'll be hanging 00:43:28.000 --> 00:43:32.001 out with us later if people have more questions and you could probably answer 00:43:32.001 --> 00:43:35.001 some questions right there in the chat. And we're going to come back to questions 00:43:35.001 --> 00:43:41.001 with you and Dominic. So let's move on, though, to our second demonstration. I'd 00:43:41.001 --> 00:43:46.000 like to introduce you to Dominic Marino. He's the Senior Solutions Architect of 00:43:46.000 --> 00:43:51.000 Storage. Storage is probably the oldest decentralized storage company out there. 00:43:51.001 --> 00:43:56.000 And with storage, the Internet Archive has been working to store LibriVox 00:43:56.000 --> 00:43:57.001 audiobooks at scale. 00:43:58.000 --> 00:44:01.001 So here to show us how that work is going, please welcome Dominic Marino. 00:44:03.000 --> 00:44:07.001 Thank you so much, Wendy. Very excited to be here speaking with everyone today. 00:44:08.000 --> 00:44:13.000 I'm Dominic, a Solutions Architect of Storage, and we're one of the leading 00:44:13.000 --> 00:44:19.001 providers of decentralized storage. We're very proud of our track record over the 00:44:19.001 --> 00:44:25.001 last, oh, goodness, it's been about eight years since Sean Wilkinson founded us 00:44:25.001 --> 00:44:32.000 in 2014 in his dorm room. We've been really excited to work with the Internet 00:44:32.000 --> 00:44:38.001 Archive on decentralizing the LibriVox audiobook series. It's a collection of 00:44:38.001 --> 00:44:45.000 over 16,000 titles and approximately 22 terabytes of data. I've worked very 00:44:45.000 --> 00:44:52.000 closely with Arkady and have had a great time learning with him as we grow this 00:44:52.000 --> 00:44:57.001 at scale, bringing these massive collections into storage. And I'm happy to show 00:44:57.001 --> 00:45:01.001 what we've done today. The first thing I'm going to do is tell you what we've 00:45:01.001 --> 00:45:05.000 done, and then I'm going to show you how we did it, give you an explanation of 00:45:05.000 --> 00:45:11.001 how our network functions. So over at storage, we're a decentralized storage 00:45:11.001 --> 00:45:17.000 provider with over 13,000 nodes on our network of which over 9 00:45:17.000 --> 00:45:19.000 ,000 are independent node operators. 00:45:19.001 --> 00:45:26.000 When you upload a file into our ecosystem, you encrypt it, then you split it, and 00:45:26.000 --> 00:45:30.001 then you distribute it out to those tens of thousands of nodes. This gives you 00:45:30.001 --> 00:45:35.001 ultimately the consumer, the control, and allows you to remain if you choose to, 00:45:36.000 --> 00:45:37.001 the custodian of the private key. 00:45:38.000 --> 00:45:44.000 We do in full disclosure work in both the Web 2 and Web 3 space. So we're engaged 00:45:44.000 --> 00:45:50.000 on a daily basis in Web 3 related activity projects in this space, as well as 00:45:50.000 --> 00:45:55.001 offering edge services that allow organizations in the Web 2 space to benefit 00:45:55.001 --> 00:46:00.000 from the inherent benefits of the Web 3 space. 00:46:00.001 --> 00:46:05.000 Meaning you can have a product today that uses something like Amazon's S3 00:46:05.000 --> 00:46:10.000 storage, and you can benefit from the redundancy, the redundancy, the 00:46:10.000 --> 00:46:15.000 performance, the value that decentralized storage brings you still in the Web 2 00:46:15.000 --> 00:46:21.000 space. So we're really focused in pushing forward, in being forward leaning, but 00:46:21.000 --> 00:46:26.000 still being able to have a very usable service by all different sorts of orders. 00:46:27.000 --> 00:46:31.001 I'm going to jump right into a quick demo and show you some things we've 00:46:31.001 --> 00:46:38.001 accomplished, as well as a very simple way to use our product. And to 00:46:38.001 --> 00:46:44.000 do that, I'm going to go through and do a quick demo of uploading a file here. So 00:46:44.000 --> 00:46:49.001 the first thing I'm going to do is pop over into our product, go to our bucket. 00:46:50.000 --> 00:46:53.000 This is not the way you need to interact with our network, but it's a way you can 00:46:53.000 --> 00:46:58.000 interact with our network. So today, I'm just going to go into this bucket, and 00:46:58.000 --> 00:47:02.001 I'm going to put in a super secure passphrase. I'm going to understand that I 00:47:02.001 --> 00:47:05.001 need to remember that passphrase because I'm the custodian of it, and the service 00:47:05.001 --> 00:47:12.000 will not remember it. I'm now in the bucket, and I'm going to upload that 00:47:12.000 --> 00:47:19.000 file. When that file uploads, I'm then going to create a share link, paste 00:47:19.000 --> 00:47:25.001 that share link in, and view it. Now this is an edge service we're running that 00:47:25.001 --> 00:47:32.001 allows you to share out items to anyone you wish. And I'm just going to post the 00:47:32.001 --> 00:47:38.000 link so Heather can post that link for you. And you can load this link. But this 00:47:38.000 --> 00:47:43.000 is, and it's hard to see, on 80 different, there's 80 different pieces, so 80 00:47:43.000 --> 00:47:49.000 different notes. You can see the distribution around the pieces. And it's that 00:47:49.000 --> 00:47:53.001 easy. To show you what we've accomplished with the Internet Archive, I'm going to 00:47:53.001 --> 00:47:58.000 actually go through their main, the root of their site. I'm going to pop into the 00:47:58.000 --> 00:48:00.000 book collection, and then I'm going to go to their most 00:48:00.000 --> 00:48:02.000 popular book, The Art of War. 00:48:03.000 --> 00:48:09.000 The Art of War for all of us that haven't recently read it or are unfamiliar with 00:48:09.000 --> 00:48:14.000 it, is a book really about avoiding war, right? War is failure. This is about 00:48:14.000 --> 00:48:21.000 taking diplomatic ties to dispersing conflict. So with 00:48:21.000 --> 00:48:25.000 the Internet Archive, we've uploaded these 16,000 plus assets. 00:48:26.000 --> 00:48:33.000 And thanks to our Katie, you can see that all assets related to 00:48:33.000 --> 00:48:39.000 this asset are available over at storage, and she will be available at the 00:48:39.000 --> 00:48:44.000 Internet Archive. So you can see how they're using us. It's a programmatic 00:48:44.000 --> 00:48:48.001 interaction, so they're able to batch upload. You can see how easy it is to just 00:48:48.001 --> 00:48:55.000 use our simple web UI to go through and upload an object and share it. 00:48:55.000 --> 00:49:00.000 And that is backed by a decentralized network. I'm going to hop back to the 00:49:00.000 --> 00:49:06.000 presentation, and then hop over to the next slide, which is a summary of what 00:49:06.000 --> 00:49:10.000 we've accomplished, a summary that you can stream on the right here. You can see 00:49:10.000 --> 00:49:13.001 what it looks like. We've made a mock in the center, as well as the list of items 00:49:13.001 --> 00:49:18.000 on the right-hand side. And then I'm just going to jump to a final slide and 00:49:18.000 --> 00:49:22.001 cover a few more things about the network. So what we're really excited about at 00:49:22.001 --> 00:49:27.000 storage is that we're given the creative freedom to produce what we need to be 00:49:27.000 --> 00:49:32.000 successful, that is to build what people want. So when we're talking about things 00:49:32.000 --> 00:49:37.001 like IPFS, and Heather, I'm going to send you another link to share. This same 00:49:37.001 --> 00:49:44.000 image that we just uploaded has also been shared via an IPFS hash. I sent you 00:49:44.000 --> 00:49:50.001 a link should be embedded in the chat now, the ipfsdemo.dev.storage.io, showing 00:49:50.001 --> 00:49:57.000 that not only is our storage decentralized, but content are addressable as well. 00:49:57.000 --> 00:50:00.000 Now that's something that's not in production today, but it's coming very soon. 00:50:00.000 --> 00:50:06.000 It's just so fantastic to be in an org that provides so much opportunity 00:50:06.000 --> 00:50:07.001 to build great things for tomorrow. 00:50:08.000 --> 00:50:10.000 As far as a little bit more detail, I see a 00:50:10.000 --> 00:50:11.001 question on how does distribution occur. 00:50:13.000 --> 00:50:19.001 We had a PhD economist actually build the model, right? So all of the nodes on 00:50:19.001 --> 00:50:23.000 our network, we don't run those nodes, by the way, those are people that come in 00:50:23.000 --> 00:50:27.000 and choose to run them, are incentivized to be good actors on the network. We 00:50:27.000 --> 00:50:31.001 can't trust that they will be, of course. So we have an audit and repair process 00:50:31.001 --> 00:50:37.001 that continually runs. That audit and repair process means that if a node drops 00:50:37.001 --> 00:50:42.000 off the network, or a node is misbehaving in the network, or a node is simply 00:50:42.000 --> 00:50:49.000 just performing poorly, the power is out. We can address that. We will manage all 00:50:49.000 --> 00:50:55.001 repair. We manage all audit. There's no need to negotiate, for instance, and have 00:50:55.001 --> 00:51:01.000 maybe inconsistent pricing, you pay one price. And all of that is done behind a 00:51:01.000 --> 00:51:07.000 service level agreement, SLA, a contract where we guarantee a level of service to 00:51:07.000 --> 00:51:13.000 you. So we are a product today that you can use in your production application. 00:51:13.001 --> 00:51:17.001 You can get the benefits of that global distribution. 00:51:18.001 --> 00:51:23.001 If you want to be distributed, yet have data sovereignty, we do that as well. So 00:51:23.001 --> 00:51:29.000 if you, for instance, are trying to seek GDPR compliance, you want to be 00:51:29.000 --> 00:51:33.000 decentralized in Europe. You don't want the data anywhere else, but the European 00:51:33.000 --> 00:51:38.000 Union, no problem. Conversely, if you're doing that in the United States for a 00:51:38.000 --> 00:51:44.001 reason, or Canada, no problem. So we're really, today, the only provider giving 00:51:44.001 --> 00:51:50.001 you that decentralized storage solution with data sovereignty. Highly usable, 00:51:51.000 --> 00:51:57.000 decentralized storage with multiple on-ramps, making it easy, as you've seen, for 00:51:57.000 --> 00:52:02.001 the Internet Archive to decentralize that large catalog of audiobooks. 00:52:05.000 --> 00:52:12.000 It's truly wonderful and I'm very fortunate to be here. Wendy, with that, I'm 00:52:12.000 --> 00:52:15.000 going to wrap and we can take care of the rest, of course, in QA. 00:52:16.001 --> 00:52:22.001 Great. Thank you so much. Let's call Arkady and Dominic back and we'll stop 00:52:22.001 --> 00:52:29.000 sharing the screen and answer a few of your questions. So one of the questions 00:52:29.000 --> 00:52:35.001 is, how can you really prove back from that hash that the originator did not fake 00:52:35.001 --> 00:52:42.001 the location? I guess, how do we know that the hash is really 00:52:42.001 --> 00:52:49.001 trustworthy? As for John, doing this conversation, presentation, you're 00:52:49.001 --> 00:52:55.000 very much right to ask that. So the hash is only as trustworthy as the context of 00:52:55.000 --> 00:53:00.001 its creation. So obviously, we end up with a certain meaning to establish a root 00:53:00.001 --> 00:53:07.001 of trust. So one way that I can see this working out is if you can imagine the 00:53:07.001 --> 00:53:11.001 Internet Archive as a catalog and as a data store. 00:53:12.000 --> 00:53:14.000 So right now, you need both places. 00:53:18.001 --> 00:53:25.000 Archival bond, I'm actually not familiar with this term. So imagine the catalog 00:53:25.000 --> 00:53:32.000 and the data store as sort of separate ideas. So if you trust the catalog, you 00:53:32.000 --> 00:53:39.000 don't necessarily have to trust the data store as if they are linked through a 00:53:39.000 --> 00:53:44.000 cryptographically secure hash. So you can imagine, for example, a censorship 00:53:44.000 --> 00:53:49.000 resistance Internet Archive where you only need to ensure the integrity by 00:53:49.000 --> 00:53:54.001 transmission of the catalog portion and then the data can be retrieved from any 00:53:54.001 --> 00:54:00.001 number of decentralized networks underlying it and that relationship is trust. So 00:54:00.001 --> 00:54:04.001 here's a question about decentralized storage working with digital preservation. 00:54:05.000 --> 00:54:11.001 How does it handle, for instance, file obsolescence? File obsolescence. So let's 00:54:11.001 --> 00:54:15.001 dig a little bit deeper into that. That is the concept of a file not being 00:54:15.001 --> 00:54:19.001 necessary after a period of time. Is that how we want to think about it? Or are 00:54:19.001 --> 00:54:26.000 we thinking about the concept of maybe like bit rock? I would guess the first and 00:54:26.000 --> 00:54:32.001 maybe Dina can help us there. But let's say you don't need this file anymore. 00:54:32.001 --> 00:54:38.001 It's defunct. How easy is it to get rid of files, take them down? In other words, 00:54:39.001 --> 00:54:44.000 is it like Glacier where it's really hard to move things around? Or can you call 00:54:44.000 --> 00:54:46.001 and change things kind of at will? 00:54:47.000 --> 00:54:53.000 I can dive in on that. So at storage, we say it's hot storage for the price of 00:54:53.000 --> 00:54:59.000 cold. So you don't know, for instance, auto tiering or lower tier layer. We won't 00:54:59.000 --> 00:55:04.000 let you do a ratio code to non-ideal ratio. We just do it right. That being said, 00:55:04.001 --> 00:55:09.001 however you handle your file management, and this is true for all storage back 00:55:09.001 --> 00:55:16.001 ends, will be how you manage the archival and potential deletion of assets 00:55:16.001 --> 00:55:21.001 at a period of time. The storage back ends generally wouldn't be responsible for 00:55:21.001 --> 00:55:28.001 that. It is managed in that archive. Well, with that, I think we are going to 00:55:28.001 --> 00:55:34.001 wrap up this session and ask everyone to join us at the next session for 00:55:34.001 --> 00:55:41.000 more to and fro. We hope you did enjoy what you heard today. And it was just a 00:55:41.000 --> 00:55:46.001 taste, a beginning. So please come back for more. We're doing this for six 00:55:46.001 --> 00:55:51.000 months. The last Thursday of every month, this is number two at 1pm 00:55:51.000 --> 00:55:53.001 Pacific, 4pm Eastern. 00:55:53.001 --> 00:55:57.000 We have four more sessions here, three of them. The next one in March 00:55:57.000 --> 00:55:58.001 is on decentralized identity. 00:55:59.001 --> 00:56:05.000 And we've also, as mentioned, developed this really beautiful resource guide with 00:56:05.000 --> 00:56:10.001 different videos, links to other companies that do this, other 00:56:10.001 --> 00:56:12.000 organizations, deeper dive reading. 00:56:13.000 --> 00:56:17.001 So please take a look. We're dropping the link to that in the chat. It will be 00:56:17.001 --> 00:56:22.000 emailed to you if you registered for this. And please share it widely. That's why 00:56:22.000 --> 00:56:24.001 it's there. Finally, I just want to say thank you.